307 research outputs found
EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance
Although current neural text-to-speech (TTS) models are able to generate
high-quality speech, intensity controllable emotional TTS is still a
challenging task. Most existing methods need external optimizations for
intensity calculation, leading to suboptimal results or degraded quality. In
this paper, we propose EmoDiff, a diffusion-based TTS model where emotion
intensity can be manipulated by a proposed soft-label guidance technique
derived from classifier guidance. Specifically, instead of being guided with a
one-hot vector for the specified emotion, EmoDiff is guided with a soft label
where the value of the specified emotion and \textit{Neutral} is set to
and respectively. The here represents the emotion
intensity and can be chosen from 0 to 1. Our experiments show that EmoDiff can
precisely control the emotion intensity while maintaining high voice quality.
Moreover, diverse speech with specified emotion intensity can be generated by
sampling in the reverse denoising process.Comment: Accepted to ICASSP202
Mechanism and Prevention of Agglomeration/Defluidization during Fluidized-Bed Reduction of Iron Ore
The mechanisms of agglomeration and defluidization and fluidization characteristic of iron oxide particles were investigated based on the theory of surface diffusion, interface reaction, surface nano/microeffect, and phase transformation. Moreover, a mathematical model was developed to predict the high-temperature defluidization behavior by the force-balance and plastic-viscous flow mechanism, and the fluidization phase diagram was obtained. On these bases, a control method of defluidization and its inhibition mechanism were proposed. As a result, the theoretical system of agglomeration/defluidization in the gas-solid fluidization was developed, and thus afforded theory support and technological bases for the solution of defluidization in industrial fluidized-bed reactors
Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge
In this paper, we describe the systems developed by the SJTU X-LANCE team for
LIMMITS 2023 Challenge, and we mainly focus on the winning system on
naturalness for track 1. The aim of this challenge is to build a multi-speaker
multi-lingual text-to-speech (TTS) system for Marathi, Hindi and Telugu. Each
of the languages has a male and a female speaker in the given dataset. In track
1, only 5 hours data from each speaker can be selected to train the TTS model.
Our system is based on the recently proposed VQTTS that utilizes VQ acoustic
feature rather than mel-spectrogram. We introduce additional speaker embeddings
and language embeddings to VQTTS for controlling the speaker and language
information. In the cross-lingual evaluations where we need to synthesize
speech in a cross-lingual speaker's voice, we provide a native speaker's
embedding to the acoustic model and the target speaker's embedding to the
vocoder. In the subjective MOS listening test on naturalness, our system
achieves 4.77 which ranks first.Comment: Accepted by ICASSP 2023 Special Session for Grand Challenge
VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature
The mainstream neural text-to-speech(TTS) pipeline is a cascade system,
including an acoustic model(AM) that predicts acoustic feature from the input
transcript and a vocoder that generates waveform according to the given
acoustic feature. However, the acoustic feature in current TTS systems is
typically mel-spectrogram, which is highly correlated along both time and
frequency axes in a complicated way, leading to a great difficulty for the AM
to predict. Although high-fidelity audio can be generated by recent neural
vocoders from ground-truth(GT) mel-spectrogram, the gap between the GT and the
predicted mel-spectrogram from AM degrades the performance of the entire TTS
system. In this work, we propose VQTTS, consisting of an AM txt2vec and a
vocoder vec2wav, which uses self-supervised vector-quantized(VQ) acoustic
feature rather than mel-spectrogram. We redesign both the AM and the vocoder
accordingly. In particular, txt2vec basically becomes a classification model
instead of a traditional regression model while vec2wav uses an additional
feature encoder before HifiGAN generator for smoothing the discontinuous
quantized feature. Our experiments show that vec2wav achieves better
reconstruction performance than HifiGAN when using self-supervised VQ acoustic
feature. Moreover, our entire TTS system VQTTS achieves state-of-the-art
performance in terms of naturalness among all current publicly available TTS
systems.Comment: This version has been removed by arXiv administrators because the
submitter did not have the authority to assign the license at the time of
submissio
VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching
Although diffusion models in text-to-speech have become a popular choice due
to their strong generative ability, the intrinsic complexity of sampling from
diffusion models harms their efficiency. Alternatively, we propose VoiceFlow,
an acoustic model that utilizes a rectified flow matching algorithm to achieve
high synthesis quality with a limited number of sampling steps. VoiceFlow
formulates the process of generating mel-spectrograms into an ordinary
differential equation conditional on text inputs, whose vector field is then
estimated. The rectified flow technique then effectively straightens its
sampling trajectory for efficient synthesis. Subjective and objective
evaluations on both single and multi-speaker corpora showed the superior
synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation
studies further verified the validity of the rectified flow technique in
VoiceFlow.Comment: 4 figure, 5 pages, submitted to ICASSP 202
Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations
Expressive text-to-speech (TTS) aims to synthesize speeches with human-like
tones, moods, or even artistic attributes. Recent advancements in expressive
TTS empower users with the ability to directly control synthesis style through
natural language prompts. However, these methods often require excessive
training with a significant amount of style-annotated data, which can be
challenging to acquire. Moreover, they may have limited adaptability due to
fixed style annotations. In this work, we present FreeStyleTTS (FS-TTS), a
controllable expressive TTS model with minimal human annotations. Our approach
utilizes a large language model (LLM) to transform expressive TTS into a style
retrieval task. The LLM selects the best-matching style references from
annotated utterances based on external style prompts, which can be raw input
text or natural language style descriptions. The selected reference guides the
TTS pipeline to synthesize speeches with the intended style. This innovative
approach provides flexible, versatile, and precise style control with minimal
human workload. Experiments on a Mandarin storytelling corpus demonstrate
FS-TTS's proficiency in leveraging LLM's semantic inference ability to retrieve
desired styles from either input text or user-defined descriptions. This
results in synthetic speeches that are closely aligned with the specified
styles.Comment: 5 pages,3 figures, submitted to ICASSP 202
Flashlight: Scalable Link Prediction with Effective Decoders
Link prediction (LP) has been recognized as an important task in graph
learning with its broad practical applications. A typical application of LP is
to retrieve the top scoring neighbors for a given source node, such as the
friend recommendation. These services desire the high inference scalability to
find the top scoring neighbors from many candidate nodes at low latencies.
There are two popular decoders that the recent LP models mainly use to compute
the edge scores from node embeddings: the HadamardMLP and Dot Product decoders.
After theoretical and empirical analysis, we find that the HadamardMLP decoders
are generally more effective for LP. However, HadamardMLP lacks the scalability
for retrieving top scoring neighbors on large graphs, since to the best of our
knowledge, there does not exist an algorithm to retrieve the top scoring
neighbors for HadamardMLP decoders in sublinear complexity. To make HadamardMLP
scalable, we propose the Flashlight algorithm to accelerate the top scoring
neighbor retrievals for HadamardMLP: a sublinear algorithm that progressively
applies approximate maximum inner product search (MIPS) techniques with
adaptively adjusted query embeddings. Empirical results show that Flashlight
improves the inference speed of LP by more than 100 times on the large
OGBL-CITATION2 dataset without sacrificing effectiveness. Our work paves the
way for large-scale LP applications with the effective HadamardMLP decoders by
greatly accelerating their inference
- …